The BBN Byblos Hindi OCR system
Identifieur interne : 001368 ( Main/Exploration ); précédent : 001367; suivant : 001369The BBN Byblos Hindi OCR system
Auteurs : Prem Natarajan [États-Unis] ; Ehry Macrostie [États-Unis] ; Michael Decerbo [États-Unis]Source :
- SPIE proceedings series [ 1017-2653 ] ; 2005.
Descripteurs français
- Pascal (Inist)
English descriptors
- KwdEn :
Abstract
The BBN Byblos OCR system implements a script-independent methodology for OCR using Hidden Markov Models (HMMs). We have successfully ported the system to Arabic, English, Chinese. Pashto, and Japanese. In this paper, we report on our recent effort in training the system to perform recognition of Hindi (Devanagari) documents. The initial experiments reported in this paper were performed using a corpus of synthetic (computer-generated) document images along with slightly degraded versions of the same that were generated by scanning printed versions of the document images and by scanning faxes of the printed versions. On a fair test set consisting of synthetic images alone we measured a character error rate of 1.0%. The character error rate on a fair test set consisting of scanned images (scans of printed versions of the synthetic images) was 1.40% while the character error rate on a fair test set of fax images (scans of printed and faxed versions of the synthetic images) was 8.7%.
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream PascalFrancis, to step Corpus: 000455
- to stream PascalFrancis, to step Curation: 000333
- to stream PascalFrancis, to step Checkpoint: 000375
- to stream Main, to step Merge: 001405
- to stream Main, to step Curation: 001368
Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">The BBN Byblos Hindi OCR system</title>
<author><name sortKey="Natarajan, Prem" sort="Natarajan, Prem" uniqKey="Natarajan P" first="Prem" last="Natarajan">Prem Natarajan</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>BBN Technologies 10 Moulton Street</s1>
<s2>Cambridge, MA 02138</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Massachusetts</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Macrostie, Ehry" sort="Macrostie, Ehry" uniqKey="Macrostie E" first="Ehry" last="Macrostie">Ehry Macrostie</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>BBN Technologies 10 Moulton Street</s1>
<s2>Cambridge, MA 02138</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Massachusetts</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Decerbo, Michael" sort="Decerbo, Michael" uniqKey="Decerbo M" first="Michael" last="Decerbo">Michael Decerbo</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>BBN Technologies 10 Moulton Street</s1>
<s2>Cambridge, MA 02138</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Massachusetts</region>
</placeName>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">05-0361372</idno>
<date when="2005">2005</date>
<idno type="stanalyst">PASCAL 05-0361372 INIST</idno>
<idno type="RBID">Pascal:05-0361372</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000455</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000333</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000375</idno>
<idno type="wicri:doubleKey">1017-2653:2005:Natarajan P:the:bbn:byblos</idno>
<idno type="wicri:Area/Main/Merge">001405</idno>
<idno type="wicri:Area/Main/Curation">001368</idno>
<idno type="wicri:Area/Main/Exploration">001368</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">The BBN Byblos Hindi OCR system</title>
<author><name sortKey="Natarajan, Prem" sort="Natarajan, Prem" uniqKey="Natarajan P" first="Prem" last="Natarajan">Prem Natarajan</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>BBN Technologies 10 Moulton Street</s1>
<s2>Cambridge, MA 02138</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Massachusetts</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Macrostie, Ehry" sort="Macrostie, Ehry" uniqKey="Macrostie E" first="Ehry" last="Macrostie">Ehry Macrostie</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>BBN Technologies 10 Moulton Street</s1>
<s2>Cambridge, MA 02138</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Massachusetts</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Decerbo, Michael" sort="Decerbo, Michael" uniqKey="Decerbo M" first="Michael" last="Decerbo">Michael Decerbo</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>BBN Technologies 10 Moulton Street</s1>
<s2>Cambridge, MA 02138</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Massachusetts</region>
</placeName>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">SPIE proceedings series</title>
<idno type="ISSN">1017-2653</idno>
<imprint><date when="2005">2005</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">SPIE proceedings series</title>
<idno type="ISSN">1017-2653</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Arabic</term>
<term>Chinese</term>
<term>Document image processing</term>
<term>English</term>
<term>Error rate</term>
<term>Hidden Markov models</term>
<term>Imaging</term>
<term>Implementation</term>
<term>Japanese</term>
<term>Learning</term>
<term>Optical character recognition</term>
<term>Probabilistic approach</term>
<term>Testing equipment</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Reconnaissance optique caractère</term>
<term>Implémentation</term>
<term>Modèle Markov variable cachée</term>
<term>Arabe</term>
<term>Anglais</term>
<term>Chinois</term>
<term>Japonais</term>
<term>Apprentissage</term>
<term>Formation image</term>
<term>Traitement image document</term>
<term>Appareillage essai</term>
<term>Taux erreur</term>
<term>Approche probabiliste</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">The BBN Byblos OCR system implements a script-independent methodology for OCR using Hidden Markov Models (HMMs). We have successfully ported the system to Arabic, English, Chinese. Pashto, and Japanese. In this paper, we report on our recent effort in training the system to perform recognition of Hindi (Devanagari) documents. The initial experiments reported in this paper were performed using a corpus of synthetic (computer-generated) document images along with slightly degraded versions of the same that were generated by scanning printed versions of the document images and by scanning faxes of the printed versions. On a fair test set consisting of synthetic images alone we measured a character error rate of 1.0%. The character error rate on a fair test set consisting of scanned images (scans of printed versions of the synthetic images) was 1.40% while the character error rate on a fair test set of fax images (scans of printed and faxed versions of the synthetic images) was 8.7%.</div>
</front>
</TEI>
<affiliations><list><country><li>États-Unis</li>
</country>
<region><li>Massachusetts</li>
</region>
</list>
<tree><country name="États-Unis"><region name="Massachusetts"><name sortKey="Natarajan, Prem" sort="Natarajan, Prem" uniqKey="Natarajan P" first="Prem" last="Natarajan">Prem Natarajan</name>
</region>
<name sortKey="Decerbo, Michael" sort="Decerbo, Michael" uniqKey="Decerbo M" first="Michael" last="Decerbo">Michael Decerbo</name>
<name sortKey="Macrostie, Ehry" sort="Macrostie, Ehry" uniqKey="Macrostie E" first="Ehry" last="Macrostie">Ehry Macrostie</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001368 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 001368 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Exploration |type= RBID |clé= Pascal:05-0361372 |texte= The BBN Byblos Hindi OCR system }}
This area was generated with Dilib version V0.6.32. |